Interconnection network topology for large scale high performance computing (HPC) systems

ABSTRACT

A multiprocessor computer system includes a plurality of processor nodes and at least a three-tier hierarchical network interconnecting the processor nodes. The hierarchical network includes a plurality of routers interconnected such that each router is connected to a subset of the plurality of processor nodes; the plurality of routers are arranged in a hierarchy of n≧3 tiers (T 1 , . . . , T n ); the plurality of routers are partitioned into disjoint groups at the first tier T 1 , the groups at tier T i  being partitioned into disjoint groups (of complete T i  groups) at the next tier T i+1  and a top tier T n  including a single group containing all of the plurality of routers; and for all tiers 1≦i≦n, each tier-T i−1  subgroup within a tier T i  group is connected by at least one link to all other tier-T i−1  subgroups within the same tier T i  group.

BACKGROUND OF THE INVENTION

The present invention relates generally to data processing and, inparticular, to an improved interconnection network topology for largescale, high performance computing (HPC) systems.

Scalable, cost-effective, and high performance interconnection networksare a prerequisite for large scale HPC systems. The dragonfly topology,described, for example, in US 2010/0049942, is a two-tier hierarchicalinterconnection network topology. At the first tier, a number of routersare connected in a group to form a large virtual router, with eachrouter providing one or more ports to connect to other groups. At thesecond tier, multiple such groups of routers are connected such that thegroups form a complete graph (full mesh), with each group having atleast one link to every other group.

The main motivation for a dragonfly topology is that a dragonflytopology effectively leverages large-radix routers to create a topologythat scales to very high node counts with a low diameter of just threehops, while providing high bisection bandwidth. Moreover, the dragonflyminimizes the number of expensive long optical links, which provides aclear cost advantage over fat tree topologies, which require more longlinks to scale to similar-size networks.

However, when considering exascale systems, fat tree and two-tierdragonfly topologies run into scaling limits. Assuming a per-node peakcompute capacity R_(n)=10 TFLOP/s, an exascale system would requireN=100,000 nodes. A non-blocking fat tree network with N end nodes builtfrom routers with r ports requires n=1+log(N/r)/log(r/2) levels (with Nrounded up to the next integer); therefore, using current Infinibandrouters with r=36 ports, this system scale requires a network with n=4levels, which amounts to 2n−1=7 router ports per end node and(2n−1)/r=0.19 routers per end node. To achieve this scale in just threelevels, routers with a radix r=74 are needed, which corresponds to 0.068routers per node.

A balanced—i.e., providing a theoretical throughput bound of 100% underuniform traffic-two-tier dragonfly network (p, a, h)=(12, 26, 12) canalso scale to about 100,000 nodes, where p is “bristling factor”indicating the number of terminals connected to each router, a is thenumber of routers in each group, and h is the number of channels in eachrouter used to connect to other groups. This corresponds to 1/12=0.083routers per node and 49/12=4.1 ports per node, which is significantlymore cost-effective than the four-level fat tree, and about on par withthe three-level fat tree, which requires much larger routers.

BRIEF SUMMARY

The present disclosure appreciates that as HPC systems scale to everincreasing node counts, integrating the router on the CPU chip wouldimprove total system cost, density, and power, which are all importantaspects of HPC systems. However, this next step in CPU integrationreverses the industry trend, in the sense that the practical radix ofsuch on-chip routers is much smaller than what has been previouslypredicted in the art. Consequently, it would be useful and desirable todeploy direct interconnection networks, such as dragonfly networks, thatscale to high node counts and arbitrary numbers of tiers, while reducingrouter radices to a level that supports commercially practicalintegration of the routers into the CPU chips.

In at least some embodiments, a multiprocessor computer system includesa plurality of processor nodes and at least a three-tier hierarchicalnetwork interconnecting the processor nodes. The hierarchical networkincludes a plurality of routers interconnected such that each router isconnected to a subset of the plurality of processor nodes; the pluralityof routers are arranged in a hierarchy of n>3 tiers (T₁ . . . , T_(n));the plurality of routers are partitioned into disjoint groups at thefirst tier T₁, the groups at tier T_(i) being partitioned into disjointgroups (of complete T_(i) groups) at the next tier T_(i+1) and a toptier T_(i) including a single group containing all of the plurality ofrouters; and for all tiers 1≦i≦n, each tier-T_(i−1) subgroup within atier T_(i) group is connected by at least one link to all othertier-T_(i−1) subgroups within the same tier T_(i) group.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 illustrates a conventional interconnection network having atwo-tier dragonfly network topology;

FIG. 2 depicts an exemplary data processing system including aninterconnection network having the smallest possible three-tiergeneralized dragonfly (GDF) topology;

FIG. 3 provides a table listing the number of hops per tier in atwo-tier dragonfly topology;

FIG. 4 provides a table listing the number of hops at each tier in athree-tier GDF topology;

FIG. 5 illustrates an example of a multi-tier dragonfly interconnectionnetwork that shows the recursive nature of minimal path routing in amulti-tier dragonfly topology;

FIG. 6A depicts a deadlock-free virtual channel assignment policy forminimal routing in the two-tier GDF topology;

FIG. 6B illustrates a deadlock-free virtual channel assignment policyfor minimal routing in the three-tier GDF topology;

FIG. 6C depicts a deadlock-free virtual channel assignment policy forminimal routing in the four-tier GDF topology;

FIG. 7 illustrates an example of an extended GDF (XGDF) topology with abundling factor of two;

FIG. 8 provides a table summarizing the number of hops at each tier ofan XDGF topology;

FIG. 9 provides a table summarizing nearly balanced three-tier XGDFtopologies;

FIG. 10 depicts a block diagram of an exemplary processing node that, inaccordance with a preferred embodiment, includes an integrated router;and

FIG. 11 illustrates an exemplary design flow that may be applied to aprocessing node having an integrated router.

DETAILED DESCRIPTION

As noted above, dragonfly topologies are highly scalable direct networkswith a good cost-performance ratio and are one of the principal optionsfor future exascale machines. Dragonfly networks are hierarchicalnetworks, and in principle, at each level of the hierarchy, a differentconnection pattern could be chosen. Most prior art dragonfly networkshave adopted the fully connected mesh at each level of the hierarchy,but others have employed alternative connectivity patterns, such as aflattened butterfly. Connection patterns other than the fully-connectedmesh increase scalability at the cost of longer shortest paths and/or anincreased number of virtual channels required to avoid deadlock.

With reference now to FIG. 1, an example of a conventionalinterconnection network 100 employing a two-tier dragonfly topology isillustrated. In the prior art, a dragonfly network is fully specified bythree parameters: p, the bristling factor indicating the number ofprocessing nodes (terminals) per router, a, the number of routers pergroup, and h, the number of ports per router to connect to other groups.As there are a routers per group and each router has h ports to connectto other groups, there are G=ah+1 groups, S=a(ah+1) routers, andN=pa(ah+1) end nodes. For the purpose of generalizing the dragonflytopology to more than the two tiers disclosed by the prior art, theconventional dragonfly topology notation can be extended to specify thenumber of peer ports per router as follows: DF(p, h₁, h₂), where h₁=a−1and h₂=h, such that G=(h₁+1)h₂+1 and S=(h₁+1)((h₁+1)h₂+1). Thus,employing this new notation, the interconnection network 100 can bespecified as DF(p, 3, 1), which has G=5 groups (designated G0-G4) andS=20 routers, each designated by a duple including the group number androuter number. (FIG. 1 omits illustration of the separate processingnodes coupled to each router to avoid obscuring the topology.)

In accordance with the present disclosure, the prior art dragonflytopology can be generalized to an arbitrary number of tiers n. Using theenhanced notation introduced supra, a generalized dragonfly (GDF)topology is specified by GDF(p; h), where p again serves as thebristling factor and h is a vector h=(h₁, . . . , h_(n)) representingthe number of peer ports per router for each tier. To provide scaling ateach tier, for each i between 1 and n inclusive, h_(i) is preferablygreater than or equal to 1.

G_(i) is further defined as the number of tier i−1 groups thatconstitute a fully connected topology at tier i. For convenience, G₀ canbe utilized to indicate the individual routers, which can each beconsidered as “groups” at tier 0:

G₀ = 1 G₁ = h₁ + 1 G₂ = G₁ ⋅ h₂ + 1 G₃ = G₁ ⋅ G₂ ⋅ h₃ + 1 …$G_{n} = {{\left( {\prod\limits_{j = 0}^{n - 1}\; G_{j}} \right) \cdot h_{n}} + 1}$

The total number of routers and nodes scales very rapidly as a functionof the parameters h_(i), as the total number of routers S_(i) at tier iis given by the product of the group sizes up to and including tier i:

${S_{i} = \left( {\prod\limits_{j = 1}^{i}\; G_{j}} \right)},$where the total number of routers equals S_(n) and the total number ofnodes N equals pS_(n).

Each router in the topology can be uniquely identified by an n-valuecoordinate vector g=(g_(n), g_(n−1), •, g_(i), •, g₁), with0≦g_(i)≦G_(i), with each coordinate indicating the relative groupposition at each tier of the hierarchy.

Referring now to FIG. 2, there is depicted an exemplary data processingsystem 200 including an interconnection network having the smallestpossible three-tier dragonfly topology. In the illustrated topology,given by GDF(p; 1, 1, 1), there are G₃=7 groups at tier 3 and S₃=42routers in total. Processing nodes (which are preferably integrated intoa common substrate with the router as described below with reference toFIG. 10) are again omitted to avoid obscuring the interconnectionpattern. It should also be noted that the routers in each successivetier 2 group are right shifted by one position to unclutter the figure.

It should be appreciated that the routers within a GDF topology aresubject to a number of different interconnection patterns. In oneexemplary implementation, a convenient interconnection pattern in whicha particular router s^(a) with coordinates g ^(a)=(g_(n) ^(a), g_(n−1)^(a), . . . , g_(i) ^(a), g₁ ^(a)) is connected at tier i to h_(i)different peer routers s^(b)(x), 0≦x≦h_(i), with coordinates g ^(b)(x)=(g_(n) ^(b)(x), g_(n−1) ^(b)(x), . . . , g_(i) ^(b)(x), g₁ ^(b)(x))according to the following pattern:

for  0 ≤ x < h_(i): ${g_{j}^{b}(x)} = \left\{ \begin{matrix}{g_{j}^{a},} & {i < j \leq n} \\{{\left( {g_{j}^{a} + 1 + {\Delta_{i}^{a} \cdot h_{i}} + x} \right){mod}\mspace{11mu} G_{j}},} & {i = j} \\{{G_{j} - 1 - g_{j}^{a}},} & {1 < j < i}\end{matrix} \right.$where the global link index Γ_(i) ^(a)(x)=Δ_(i) ^(a)·h_(i)+x, whereΔ_(i) ^(a) is the relative index of router s^(a) within its group attier i and is given by:

$\Delta_{i}^{a} = {{\sum\limits_{k = 1}^{i - 1}\;\left( {g_{k}^{a}{\prod\limits_{p = 1}^{k - 1}\; G_{p}}} \right)} = {\sum\limits_{k = 1}^{i - 1}\;{g_{k}^{a} \cdot S_{k - 1}}}}$

This interconnection pattern allows each router s^(a) to easilydetermine the positions of its neighbors based on its own position g^(a), a dimension index i, and a link index x within that dimension.This knowledge can enable a router to take correct routing decisions.Although routing could be implemented using one or more routingdatabases, it is preferable, especially when scaling to very large nodecounts, to implement algorithmic routing by employing strictlytopological addressing, thus enabling routing decisions by implementingmathematical operations on the address coordinates.

From the foregoing equations, it is clear that GDF interconnectionnetwork topologies scale to extremely large node counts even with smallh_(i) values utilizing just a few tiers. Given a fixed router radix rexcluding the end-node-facing ports, the values of h_(i) can be selectedto maximize the total number of routers S_(n), subject to the constraintΣ_(i=1) ^(n)h_(i)=r. For example, in the two-tier case, the total numberof routers equals S₂(h₁, h₂)=h₁ ²h₂+2 h₁h₂+h₁+h₂+1. Substituting h₂=r−h₁in the relation yields S₂(h1)=−h₁ ³+(r−2)h₁ ²+2rh₁+r+1. Differentiatingthis relation with respect to h₁ yields

$\frac{\partial S_{2}}{\partial h_{1}} = {{{- 3}h_{1}^{2}} + {\left( {{2r} - 4} \right)h_{1}} + {2{r.}}}$

Setting the derivative to zero and solving for h₁ finally yields:h ₁ ^(opt)=⅓(r−2+√{square root over (r ²+2r+4)})which for large r can approximated by

$\frac{h_{1}^{opt}}{2} = {\frac{h_{2}^{opt}}{1} = {\frac{1}{3}{r.}}}$Because each h_(i) must be an integer value, the real maximum can bedetermined by evaluating the integer combinations around the exactmathematical maximum. In the three-tier case, the total number ofrouters equalsS ₃(h ₁ ,h ₂ ,h ₃)=(((h ₁+1)h ₂+1)(h ₁+1)h ₃+1)·((h ₁+1)h ₂+1)(h ₁+1),which for large h₁ scales as {tilde over (S)}₃ (h₁, h₂, h₃)=h₁ ⁴ h₂ ²h₃. Substituting h₃=r−h₁−h₂ yields{tilde over (S)} ₃(h ₁ ,h ₂)=rh ₁ ⁴ h ₂ ² −h ₁ ⁵ h ₂ ² −h ₁ ⁴ h ₁ ³.

Applying partial differentiation of {tilde over (S)}₃ (h₁, h₂) withrespect to h₁ and h₂ gives

$\frac{\partial g}{\partial h_{1}} = {h_{1}^{3}{h_{2}^{2}\left( {{4r} - {5h_{1}} - {4h_{2}}} \right)}}$and$\frac{\partial g}{\partial h_{2}} = {h_{1}^{4}{{h_{2}\left( {{2r} - {2h_{1}} - {3h_{2}}} \right)}.}}$Setting each of these partial differentials to zero yields h₁=4(r−h₂)/5and h₁=(2r−3h₂)/2, which after simple manipulations results in

${\frac{h_{1}^{opt}}{4} = {\frac{h_{2}^{opt}}{2} = {\frac{h_{2}^{opt}}{1} = {\frac{1}{7}r}}}},$meaning that for large r the optimal ratio h₁:h₂:h₃ equals 4:2:1.

Using a similar analysis, it can be shown that for a four-tier dragonflytopology, where the total number of routers S₄ scales as h₁ ⁸h₂ ⁴ h₃²h₄, the optimal choice for the parameters h_(i) is as follows:

$\frac{h_{1}^{opt}}{8} = {\frac{h_{2}^{opt}}{4} = {\frac{h_{3}^{opt}}{2} = {\frac{h_{4}^{opt}}{1} = {\frac{1}{15}{r.}}}}}$Although these values are approximations, combinations of integer valuesthat are close to these approximations and sum up to r indeed providethe best scaling.

The bisection B (expressed in the number of links) of a two-tierdragonfly DF(p, h₁, h₂) equals the minimum number of links that need tobe cut to separate the network into two equal-sized halves. For balancednetworks, this equals the worst-case cut between groups. As each groupis connected to all groups in the other half by exactly one link, wherethe number of groups G=(h₁+1)h₂+1, the bisection is given as follows:

B = G²/2, G  mod  2 = 0 (G − 1)²/2 + (h¹ + 1)²/2, G  mod  2 = 1Note that each link is counted twice because the links arebidirectional.

For a generalized dragonfly GDF(p; h), a similar result is obtained. Theinterconnection network is separated into two halves at the top tier,and the number of global links between the halves is counted. As thereare G_(n) groups at the top tier, if G_(n) is even, B_(n)=G_(n) ²/2. Thegeneralized expression for odd G_(n) is more involved, but for largenetworks B_(n)=G_(n) ²/2 is a reasonable approximation:

The relative bisection per node B_(n)/N can be expressed as follows:

${B_{n}/N} = {{{G_{n}^{2}/2}N} = {\frac{G_{n}^{2}}{2{p \cdot {\prod\limits_{j - 1}^{n}\; G_{j}}}} = {\frac{G_{n}}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}}}} = {\frac{{\prod\limits_{j = 1}^{n - 1}\;{G_{j} \cdot h_{n}}} + 1}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}}}} = {\frac{1}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}}}} + {{h_{n}/2}p}}}}}}$

Thus, B_(n)/N>h_(n)/2p. Therefore, to ensure full bisection bandwidthh_(n)≧2p. Note that this relation depends only on the top-tier h_(n) andthe bristling factor p, but not on the number of tiers n nor on any ofthe lower-tier values h_(i<n). However, to ensure that the lower tiersdo not impose a bottleneck, the number of times each type of link istraversed must be considered. For shortest-path (direct) routing, mostpaths traverse 2^(n−i) links at tier i, where 1≦i<n. Therefore, thevalues for h_(i) at each tier should satisfy h_(i)=2h_(i+1), where1≦i<n. Thus, h_(i)/h_(n)=2^(n−1), for 1≦i<n or h_(i)=2n^(+i+1)p, for1≦i<n.

The average distance d_(avg) (i.e., the number of inter-router hops) ina two-tier GDF(p; h₁, h₂) follows from the path count breakdown listedin Table I shown in FIG. 3. By l_(i) a hop on a link belonging to tier iis denoted.

Hence, the average distance in a two-tier dragonfly network is given by:

$d_{avg}^{2t} = {\frac{{3\left( {h_{1}^{2}h_{2}} \right)} + {2\left( {2h_{1}h_{2}} \right)} + {1\left( {h_{1} + h_{2}} \right)}}{{h_{1}^{2}h_{2}} + {2h_{1}h_{2}} + h_{1} + h_{2} + 1}.}$

Similarly, the average distance in a three-tier dragonfly network isexpressed as:

$d_{avg}^{3t} = {\frac{\begin{matrix}{{7\left( {h_{1}^{4}h_{2}^{2}h_{3}} \right)} + {6\left( {4h_{1}^{3}h_{2}^{2}h_{3}} \right)} +} \\{{5\left( {{2h_{1}^{3}h_{2}h_{3}} + {6h_{1}^{2}h_{2}^{2}h_{3}}} \right)} + {4\left( {{6h_{1}^{2}h_{2}h_{3}} + {4h_{1}h_{2}^{2}h_{3}}} \right)}}\end{matrix}}{\begin{matrix}{{h_{1}^{4}h_{2}^{2}h_{3}} + {4h_{1}^{3}h_{2}^{2}h_{3}} + {2h_{1}^{3}h_{2}h_{3}} +} \\{{6h_{1}^{2}h_{2}^{2}h_{3}} + {6h_{1}^{2}h_{2}h_{3}} + {4h_{1}h_{2}^{2}h_{3}}}\end{matrix}}\ldots\frac{\begin{matrix}{{{+ 3}\left( {{h_{1}^{2}h_{2}} + {h_{1}^{2}h_{3}} + {h_{2}^{2}h_{3}} + {6h_{1}h_{2}h_{3}}} \right)} +} \\{{2\left( {{h_{1}h_{2}} + {2h_{1}h_{3}} + {2h_{2}h_{3}}} \right)} + {1\left( {h_{1} + h_{2} + h_{3}} \right)}}\end{matrix}}{\begin{matrix}{{{+ h_{1}^{2}}h_{2}} + {h_{1}^{2}h_{3}} + {h_{2}^{2}h_{3}} + {6h_{1}h_{2}h_{3}} + {2h_{1}h_{2}} +} \\{{2h_{1}h_{3}} + {2h_{2}h_{3}} + h_{1} + h_{2} + h_{3} + 1}\end{matrix}}}$

As discussed above, satisfying h_(i)=2^(n−i+1)p, for 1≦i<n provides fullbisection bandwidth. However, the notion of a balanced dragonfly networkrequires only half bisection bandwidth, because a balanced dragonflynetwork assumes uniform traffic. For uniform traffic, only half of thetraffic crosses the bisection. Correspondingly, the h_(i)/p values for abalanced multi-tier dragonfly network are halved: h_(i)=2^(n−i)p, for1≦i<n. Note that these ratios between the h^(i) values are also theoptimal ratios between h_(i) values to achieve the maximum total systemsize.

Formally, a GDF(p; (h₁, . . . , h_(n))) is referred to as balanced iffor all i, 1≦i<n, h/h_(i+1)=H_(i)/H₁₊₁, where H_(i) represents the totalnumber of tier-i hops across all shortest paths from a given node toevery other node, which is equivalent to the hop-count distributionunder uniform traffic. The balance ratios β_(i,j), are defined asβ_(i,j)h_(i)H_(j)/h_(j)H_(i). A network is perfectly balanced when forall i, 1≦i<n: β_(i,i+1)=1. In practice, achieving ratios exactly equalto one may not be possible, so it is preferable if the ratios are asclose to 1 as possible. If β_(i,i+1)>1, the network (at tier i) isoverdimensioned, whereas the network is underdimensioned ifβ_(i,i,+1)<1. It should be noted that as employed herein the term“balance” relates specifically to bandwidth, not router port counts;however, as equal bandwidth on all links is assumed herein (unlessspecifically stated otherwise), router port count and link bandwidth canappropriately be viewed as equivalent concepts.

The relation h_(i)=2^(n−i)p, for 1≦i<n for balanced dragonfly networksis sufficient but conservative for a balanced network because, dependingon h, a certain fraction of minimal paths are shorter than 2^(n−1) hops.More precisely, for a two-tier dragonfly network, the ratio between thenumber of l versus l₂ hops equals

$\frac{H_{1}}{H_{2}} = {\frac{{2h_{1}^{2}} + {2h_{1}h_{2}} + h_{1}}{{h_{1}^{2}h_{2}} + {2h_{1}h_{2}} + h_{2}}.}$As a consequence, for small first-tier groups, the effective balancefactor is clearly smaller than 2. The exact ratio between h₁ and h₂ fora balanced network can be determined by solving β_(1,2)=1. Inparticular, since h₁=h₂H₁/H₂, h₁=h₂−1+√{square root over (h²+1)}≈2h₂−1.

For three-tier balance, Table II shown in FIG. 4 lists all the differentpath types, all paths of a given type having the same number of hops foreach link type (i₁, l₂, l₃), the number of times each path type occurs,and the number of hops per link type for each path type. Note that mostpath types can occur in different variants (e.g., l₁−l₂ and l₂−l₁),which is accounted for in the third column.

From Table II, the total number of hops H_(i) per link type l_(i) can beobtained as follows:H ₁ =h ₁(1+2h ₂+2h ₁ h ₂ +h ₃(2+2h ₁+6h ₂+4h ₂ ²+12h ₁ ² h ₂ ²+4h ₁ ³ h₂ ²))H ₂ =h ₂(1+2h ₁+2h ₁ ² +h ₃(2+2h ₁+6h ₁+2h ₂+8h ₁ h ₂+6h ₁ ²+12h ₁^(2h)+2h ₁ ³+8h ₁ ³ h ₂+2h ₁ ⁴ h ₂))H ₃ =h ₃(1+2h ₁+2h ₂ +h ₁ ²+6h ₁ h ₂ +h ₂ ²+4h ₁ h ₂ ²+6h ₁ ² h ₂+6h ₁ ²h ₂ ²++2h ₁ ³ h ₂+4h ₁ ³ h ₂ ² +h ₁ ⁴ h ₂ ²)

From these relations, the balance ratios H₁/H₂ and H₂/H₃ can bedetermined. Although the value of H₁/H₂ is not entirely independent ofh₃, it can be shown that the derivative

$\frac{\partial H_{1}}{\partial h_{3}}$is extremely close to zero (<4e-16) for any valid combination of h₁, h₂,and h₃, implying that the two-tier balance condition for h₁ and h₂ alsoholds in the three-tier case. The condition for h₃ can be determined bysolving h₃=h₂H₃/H₂, which following reductions yields.

$h_{3} = {\frac{h_{2}^{3} + h_{2}}{{2h_{2}^{2}} + 1} \approx {\frac{h_{2}}{2}.}}$It should further be appreciated that the condition h_(i)=2*h_(i+1) fora balanced network is satisfied by any combinationh_(i)*BW_(i)=2*(h_(i+1)*BW_(i+1)), where BW_(i) is the bandwidth perport at tier i, which could be achieved by doubling BW_(i+1) rather thanh_(i+1).

Substituting the conditions for a balanced dragonfly network into theexpression for the total number of nodes N yields an expression for Nthat depends only on a single variable, allowing the network parametersp and h_(i) to be uniquely determined as a function of N. For example,for a two-tier topology, N=4h₂ ⁴+2h₂ ². Solving this equation for h₂yields

${h^{{bal} - {2t}}(N)} = {\frac{1}{2}\sqrt{\sqrt{\left( {{4N} + 1} \right) - 1},}}$which for large N is approximately equal to

$\sqrt[4]{N}/{\sqrt{2}.}$Therefore using the conditions for a balanced dragonfly topology, thebalanced router radix for the two-tier case equals

${r_{{bal} - {2t}}(N)} = {{{4h_{2}} - 1} = {{2\sqrt{\sqrt{\left( {{4N} + 1} \right) - 1}}} - 1.}}$It can similarly be shown that for the three-tier case the respectiveexpressions are:

N = 1024h₃⁸ + 256h₃⁶ + 48h₃⁴ + 4h₃²${h^{{bal} - {3t}}(N)} = {\frac{1}{4}\sqrt{\sqrt{{4\sqrt{{4N} + 1}} - 3} - 1}}$${r_{{bal} - {3t}}(N)} = {{{8h_{3}} - 1} = {{2\sqrt{\sqrt{{4\sqrt{{4N} + 1}} - 3} - 1}} - 1.}}$

Network costs in terms of number of routers and inter-router links perend node equal 1/p routers/node and

$\frac{p + {\sum\limits_{j = 1}^{n}h_{i}}}{p}$links/node. For balanced networks these ratios amount to 1/h₂ and(4h₂−1)/h₂≈4 for two tiers and 1/h₃ and (8h₃−1)/h₃=8 for three tiers.

Turning now to routing considerations, routing packets in aninterconnection network having a dragonfly topology can include bothshortest-path, i.e., minimal or direct routing, as well as non-minimalor indirect routing. Because exactly one shortest path exists betweenany pair of routers (or processing nodes), minimal routing is the morestraightforward case. In a preferred embodiment, a dimension orderrouting approach is implemented in which each dimension corresponds to atier of the dragonfly topology. The routers comprising the minimal routedetermine the minimal route in n phases, where at each phase, one ormore of the routers in the minimal route compares its coordinates andthose of the destination router and routes the packet to the correctposition (group) in the current dimension (tier) before the packet ispassed to the next dimension. In a GDF, routers preferably order thedimensions from most to least significant, because routing to adestination group at a certain tier i may require hops through some orall of the dimensions lower than i as well.

As an example, assuming n=3, source coordinates (g₁ ^(s),g₂ ^(s),g₃^(s)) and destination coordinates (g₁ ^(s), g₂ ^(s), g₃ ^(s)), therouting algorithm implemented by the routers in the dragonflyinterconnection network first routes to the correct destination group attier 3 (x, y, g₃ ^(d)), then within that top-tier group to the correctsubgroup at tier 2 (x, g₂ ^(d), g₃ ^(s)), and finally to the correctsub-subgroup at tier 1 (g₁ ^(d), g₂ ^(d), g₃). Because of the particularstructure of a GDF, in which exactly one link connects two given groupsat a given tier, each of the n routing phases (one per dimension) mayrequire intermediate hops to get to the router that connects the currentgroup to the destination group.

With reference now to FIG. 5, an example of a multi-tier dragonflyinterconnection network 500 is illustrated that shows the recursivenature of minimal path routing in a multi-tier dragonfly topology. Inthis example, interconnection network 500 includes a source processingnode 502 a, a first destination processing node 502 b and a seconddestination processing node 502 c. Processing nodes 502 a-502 c areinterconnected by a plurality of routers 504 interconnected in adragonfly interconnection network including a plurality of first tiergroups 506 including a first tier group 506 a that includes a router 504a to which source processing node 502 a is directly connected, a firsttier group 506 b that includes a router 504 b to which first destinationprocessing node 502 b is directly connected, a first tier group 506 c,and a first tier group 506 d that includes a router 504 c to whichsecond destination processing node 502 c is directly connected. Asshown, first tier groups 506 a-506 b in turn belong to a second tiergroup 508 a, and first tier groups 506 c-506 d in turn belong to asecond tier group 508 b.

In a generalized dragonfly topology, the diameter at tier i is denotedby d_(i). From the above discussion, it follows that the diameter atsuccessive tiers can be derived by the simple recursive formulad_(i)=2d_(i−1)+1, 1≦i≦n, with d₀=0. It follows that the diameter d_(n)is given by

$d_{n} = {{\sum\limits_{i = 0}^{n - 1}2^{i}} = {2^{n} - 1}}$In this example, the distance from source processing node 502 a to firstdestination node 502 b equals three, that is, the diameter at tier 2(i.e., d₂) equals 3. The distance from source processing node 502 a tosecond destination processing node 540 c equals twice the diameter attier 2 plus the hop connecting the two tier-2 groups 508 a-508 b. Hence,the diameter at tier 3 (i.e., d₃) equals 7.

For non-minimal routing, path diversity in a GDF network is immense, andmany different techniques of performing indirect routing, such asValiant routing, are possible. In Valiant routing, a packet is firstrouted to a randomly selected intermediate group at the “nearest commonancestor” tier between the source and the destination processing nodes,and from there to the destination processing node. In other words, giventhat both source and destination processing nodes are within the samegroup at tier i, Valiant routing allows an indirect path to visit oneintermediate group at tier i−1. For instance, if source and destinationprocessing nodes are within the same tier 1 group, the longest indirectpath is l₁→l₁ (the intermediate “group” in this case comprises just asingle router). If source and destination processing nodes are withinthe same tier 2 group, then the longest (5-hop) indirect path isl₁→l₂→l₁→l₂→l₁, whereas if the direct path leads up to tier 3, acorresponding (11-hop) indirect path isl₁→l₂→l₁→l₃→l₁→l₂→l₁→l₃→l₁→l₂→l₁. It follows that the longest indirectpath according to Valiant routing policy has a length of2n¹⁻¹+2^(n)−1=3·2^(n−1)−1 hops. A variant of Valiant routing,Valiant-Any, which can mitigate certain adverse traffic patterns, allowsmisrouting to any intermediate router rather than any intermediategroup. In this case the longest indirect path length equals2(2^(n)−1)=2⁺¹−2 hops.

Shortest-path routing in dragonfly networks is inherently prone todeadlocks because shortest-path routing induces cyclic dependencies inthe channel dependency graph. Without a proper deadlock avoidance policyin place, forward progress cannot be guaranteed. It has previously beendemonstrated that to guarantee deadlock freedom the basic two-tierdragonfly requires for the tier-1 links two virtual channels forshortest-path routing and three virtual channels for indirect routing.

To guarantee deadlock freedom in GDF topologies with an arbitrary numberof tiers, a sufficient number of virtual channels (VC) must beallocated. Referring now to FIGS. 6A-6C, there are depicted channeldependency graphs for two-, three-, and four-tier dragonfly topologies,respectively. In FIGS. 6A-6C, source and destination processing nodesare illustrated connected via circles each representing a respectiveclass of channels identified by dimension i and by virtual channel j.Note that there are no self-dependencies, as links belonging to the samedimension can never be traversed consecutively. The dashed line in eachfigure indicates the VC assignment policy for a worst-case shortestpath.

FIG. 6A illustrates a deadlock-free virtual channel assignment policyfor minimal routing in the two-tier case. The VC identifier isincremented when moving from a hop at tier 2 (l₂) to a hop at tier 1(l₁). This VC incrementation is sufficient to break cyclic dependencies.In the three-tier case shown in FIG. 6B, a total of four VCs on thelowest tier, two VCs on the second tier and one VC on the third tier isnecessary. (To unclutter the figure, the arrows from the various classesof channels towards the destination processing node are omitted, as eachof the hops can be the final one in the path.) The four-tier case shownin FIG. 6C employs a total of eight VCs on the lowest tier, four VCs onthe second tier, two VCs on the third tier and one VC on the fourthtier.

In the general case, a GDF with n tiers requires 2^(n−i) virtualchannels for deadlock avoidance with shortest-path routing on tier i.This follows from the maximum number of times that hops at a given tierare visited, which is exactly 2^(n−i) for tier i for the worst-caseshortest paths.

Given an incoming port corresponding to tier l₁; and channel vc₁, anoutbound port corresponding to tier l₂ and channel vc₂, with λ=|l₂−l₁|being the number of tiers crossed, then the outbound virtual channel vc₂is given by

$\begin{matrix}{{{vc}_{2} = \left\lfloor {{vc}_{1}/2^{\lambda}} \right\rfloor},{l_{2} > l_{1}}} \\{{= {{{vc}_{1} \cdot 2^{\lambda}} + 2^{\lambda - 1}}},{l_{2} < l_{1}}}\end{matrix}$Note that assigning the next VC according to this deadlock-freegeneral-case VC assignment policy only requires knowledge of the currentVC and the current and next dimension, and can be implemented by therouters in a GDF network using simple shifting and addition operations.

The indirect routing paths described above require additional virtualchannels for deadlock-free operation, as the additional available routescan create additional cycles in the channel dependency graphs. ForValiant routing, the maximum number of hops at a given tier i of such anindirect path equals 2^(n−i−1)+2^(n−i) (for n−i≧1; at tier n there aretwo hops at maximum). Correspondingly, each link at tier i shouldprovide at least 2^(n−i−1)+2^(n−i) virtual channels. For the tier-1links, this policy implies three channels for n=2 and six for n=3.

Consequently, the virtual channel assignment policy set forth abovechanges. As an indirect path is composed of an initial Valiant componentfrom the source processing node to the selected intermediatedestination, followed by the shortest path from the intermediatedestination to the actual destination processing node. The number oftier-i hops of the first and second components are at most 2^(n−i) and2n^(−i−1), respectively. For the second component, the VC assignmentpolicy set forth above is applied, with VCs numbered from 0 to 2^(n−i−1)for each tier as before. For the first component, which is roughly halfa shortest path, the shortest-path assignment is also applied, exceptthat the VCs are numbered from 2^(n−i) to 2^(n−i)+2^(n−i−1)−1.Therefore, prior to applying the VC assignment policy set forth above, avalue of 2^(n−i) is subtracted, and added back in afterwards. Tocorrectly compute the VC, the VC assignment policy must therefore beaware of whether the routing phase is Valiant (first part) or not(second part).

The present disclosure further appreciates that the GDF topologydescribed previously, which employs only a single link between each pairof groups at a given tier, can be extended to employ multiple linksbetween groups. This arrangement gives rise to a new class ofinterconnection networks referred to herein as extended generalizeddragonfly (XGDF) topologies. XGDF networks cover the entire spectrum ofnetwork topologies from fully-scaled dragonflies to Hamming graphs, inwhich each router is directly connected with its peers in each of theother groups at each tier. Note that XGDF topologies can all be builtusing different configurations of the same “building block” router.

An extended generalized dragonfly is specified by XGDF(p; h; b). Inaddition to p and h=(h₁, . . . , h_(n)), which have the same meaning asdefined for the base GDF topology, a vector of bundling values b=(b₁, .. . , b_(n)) is specified, indicating the number of links in betweeneach pair of groups at each tier i. Note that the case b_(i)=1 for all icorresponds to the base GDF topology with a single link in between eachpair of groups.

FIG. 7 illustrates an example of an XGDF topology with a bundling factorb₂=2. This topology has ((h₁/b₁+1)h₂)/b₂+1=7 groups of 4 routers eachwhere b_(i) is a divisor of the number of switches per tier 1 group.Alternatively, one could choose the bundling factor b₂ equal to 3 or 6.With b₂=3, the resulting topology would have five groups, whereas withb₂=6 the topology would have three groups.

To obtain a completely regular topology, the constraint is imposed that,at each tier i, the number of outbound links from a tier-(i−1) groupmust be an integer multiple of b_(i), so that each bundle has exactlythe same number of subgroups and links:

${\forall{i\text{:}\mspace{14mu}\left( {h_{i}{\prod\limits_{j - 1}^{i - 1}\; G_{j}^{\prime}}} \right){mod}\mspace{14mu} b_{i}}} = 0$Moreover, it is assumed b₁=1, because there are no subgroups at thefirst tier and from a topological perspective it is not useful tointroduce multiple links between a pair of routers. (Although suchredundant links can be present for non-topological reasons.)

The number of groups at each tier G_(i)′ is given by the followingequations:

$G_{1}^{\prime} = {\frac{h_{1}}{b_{1}} + 1}$$G_{2}^{\prime} = {\frac{G_{1}^{\prime}h_{2}}{b_{2}} + 1}$$G_{3}^{\prime} = {\frac{G_{1}^{\prime}G_{2}^{\prime}h_{3}}{b_{3}} + 1}$$G_{n}^{\prime} = {\frac{\left( {\prod\limits_{j = 1}^{n - 1}\; G_{j}^{\prime}} \right)h_{n}}{b_{n}} + 1}$The total number of switches S_(i)′ in one of the groups at tier i isgiven by the product of the group sizes up to tier i:

$S_{i}^{\prime} = {\prod\limits_{j = 1}^{i}\; G_{j}^{\prime}}$

The total number of nodes N′ in the system equals N′=pS′_(n). Eachswitch in the topology can be uniquely identified by an n-valuecoordinate vector (g₁, g₂, . . . , g_(n)), with 0≦g_(i)<G′_(i) with eachcoordinate indicating the relative group position at each tier of thehierarchy.

“Full bundling” at tier i is defined herein as the case where

$b_{i} = {\prod\limits_{j = 1}^{n - i}\; G_{j}^{\prime}}$such that G_(i)′=h_(i)+1. This relation implies that every router has adirect link to its respective peers in each other group at tier i. Ifall tiers have full bundling, the topology is equivalent to a prior artHyperX (i.e., Hamming graph) topology. Consequently, herein it isassumed that an XGDF does not have full bundling at all tiers. It shouldbe noted that the number of groups at each tier and therefore the totalnumber of routers and end nodes decreases with increasing values ofb_(i). By virtue of the XGDF definition, the switch radix is independentof the b_(i) value, r=p+Σ_(j=1) ^(n)h_(i). Therefore, bundling tradesoff network scale for average distance, and a bundled network requires alarger router radix than one without bundling in order to scale to thesame size.

The interconnection pattern for an arbitrary XGDF is the same as thatfor a GDF topology, with one minor modification to Γ_(i) ^(a)(x):Γ_(i) ^(a)(x)=(Δ_(i) ^(a) ·h _(i) +x)mod(G _(j)−1)to account for the fact that there are now more global links than thenumber of (other) remote groups G_(j)−1.

As every group is connected to every other group by b_(n) links, thenumber of links that cross the bisection at the top tier equalsB′_(n)=(G′_(n))²b_(n)/2 (assuming G′_(n) is even). Hence, the relativebisection per node can be given as:

$\frac{B_{n}^{\prime}}{N^{\prime}} = {\frac{G_{n}^{\prime}}{2N^{\prime}} = {\frac{\left( G_{n}^{\prime} \right)^{2}b_{n}}{2{p \cdot {\prod\limits_{j = 1}^{n}\; G_{j}^{\prime}}}} = {\frac{G_{n}^{\prime}}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}^{\prime}}}} = {\frac{\left( {\frac{{\prod\limits_{j = 1}^{n - 1}\;{G_{j}^{\prime} \cdot h_{n}}} + 1}{b_{n}} + 1} \right)b_{n}}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}^{\prime}}}} = {{\frac{b_{n}}{2{p \cdot {\prod\limits_{j = 1}^{n - 1}\; G_{j}^{\prime}}}} + \frac{h_{n}b_{n}}{2p\; b_{n}}} > \frac{h_{n}}{2p}}}}}}$The significance of this result is that the top-tier bisection does notdepend on b_(n), but only on h_(n). Similarly, the bisection bandwidthsat lower tiers also do not depend on b. In other words, the relativebisection of an XGDF is not affected by the bundling factors, but isdetermined fully by h.

The average distance in a two-tier XGDF(p; h₁, h₂; 1, b₂) is given by

$d_{avg}^{2t} = \frac{{3\left( {x^{2}\frac{h_{2}}{b_{2}}} \right)} + {2\left( {\left( {\frac{h_{1}}{b_{1}} + x} \right)h_{2}} \right)} + {1\left( {\frac{h_{1}}{b_{1}} + h_{2}} \right)}}{\frac{h_{1}^{2}h_{2}}{b_{1}^{2}b_{2}} + {2\frac{h_{1}h_{2}}{b_{1}b_{2}}} + \frac{h_{1}}{b_{1}} + \frac{h_{2}}{b_{2}} + 1}$Here, the correction factor x accounts for the fact that—unlike in GDFswithout bundling—there may be multiple shortest paths to a given node:

Routing in an XGDF is in principle the same as in a GDF as groups aretraversed in a hierarchical manner. The longest direct routes areidentical to those for the GDF without bundling. However, in the casewhere full bundling is implemented at a certain tier, the route at thattier “degenerates” in the sense that no hops at a lower tier arenecessary to each the destination group because each switch has a directlinks to all its peers. Even in a full-scale non-bundled dragonflytopology, a certain fraction of the routes is (partially) degenerate. Anexample of this in a two-tier dragonfly topology is the four routes l₁,l₂, l₁→l₂, and l₂−l₁. As the bundling factor increases, the relativefraction of such routes will increase, because there are more switchesat a given tier that have a link to a given destination group at thattier.

In the extreme case of the Hamming graph, all shortest paths haveexactly n hops vs. 2^(n−1) for the full-scale dragonfly topology.Naturally, this also has implications for network balance and deadlockavoidance. Deadlocks disappear entirely in Hamming graphs as long as astrict dimension-order routing policy is applied, so multiple virtualchannels would not be required for direct, shortest path routing. Asecond virtual channel is required, however, for indirect routing.

As there are multiple paths between peer groups in a bundled network, ateach tier a specific link must be selected. In a preferred embodiment,routers in an XGDF topology implement a shortest-path routing schemethat load balances the traffic equally across the available paths. Forexample, routers at every tier may select a link from a bundle utilizinga hashing function on packet ID, source, and destination. Thisload-balancing is preferably implemented when the source router is notconnected to the destination group and the destination router is notconnected to the source group to ensure the shortest-path property. Thisshortest-path routing algorithm induces an imbalanced usage of locallinks, however. To illustrate this point, consider a simple case withn=2 and b₂>1, XGDF(p; h₁, h₂; 1, b₂). In such a system, a given routerwith coordinates (x₁, x₂) is one of b₂ local routers all connecting toeach specific remote group to which this router is also connected. Thelocal links of the designated router can be classified into two types:Type 1 links connect to a local router that does not connect to the sameremote groups, and Type 2 links connect to a local router that doesconnect to the same remote groups. Assuming uniform random traffic(without self-traffic), the relative load on each remote link arrivingat the router equals ρ=G₁/b₂G₂.

Given a traffic arrival intensity of μ, the total load on local links ofTypes 1 and 2 equals

$\lambda_{T\; 1} = {\mu\;{p\left( {\frac{1}{G_{1}G_{2}} + \frac{\rho\; h_{2}}{G_{1}} + \frac{h_{2}}{b_{2}G_{2}}} \right)}}$$\lambda_{T\; 2} = {\mu\;{p\left( {\frac{1}{G_{1}G_{2}} + \frac{\rho\; h_{2}}{G_{1}} + 0} \right)}}$where the first terms corresponds to local traffic, the second terms totraffic arriving from remote links, and the third terms to traffic witha destination in a remote group other than the one to which thedesignated router connects. Note that the third terms are the ones thatcause the imbalance—because of the shortest-path requirement, a routermay not load-balance remote traffic to remote groups to which it itselfhas a link.

From the above, it follows that the load ratio between the two types oflocal links equals

$\frac{\lambda_{T\; 1}}{\lambda_{T\; 2}} = \frac{1 + {2\frac{G_{1}h_{2}}{b_{2}}}}{1 + \frac{G_{1}h_{2}}{b_{2}}}$which for large values of the ratio (G₁h₂)/b₂ approaches two. This ratiorepresents the worst-case imbalance, which only applies to networks inwhich the third term is either zero or maximum. This third term, whichrepresents traffic generated at the local source router with adestination in a remote group, depends on how many remote groups thelocal source router and the local intermediate router have in common.This overlap Ω(l) can range from 0 to h₂. If Ω(l)=0 then the total loadfor Type 1 links applies, and if Ω(l)=h₂ then the total load for Type 2links applies. As the overlap may be different for every local link, theload on links can be generalized as follows:

$\lambda_{l} = {\mu\;{p\left( {\frac{1}{G_{1}G_{2}} + \frac{\rho\; h_{2}}{G_{1}} + {\frac{h_{2}}{b_{2}G_{2}}\left( {1 - \frac{\Omega(l)}{h_{2}}} \right)}} \right)}}$

To determine the maximum load λ_(i) across all local links l of a givenrouter, the minimum Ω_(min)=min_(l) (Ω(l)). If the number of remotegroups G₁h₂/b₂ is at least twice the number of remote links h₂, i.e.,G₁h₂/b₂≧2h₂, then there is at least one link for which Ω(l) equals zero,namely the link to the next local router, and therefore Ω_(min)=0, suchthat the generalized link load equation reduces to that of the Type 1links. This condition can be simplified to G₁/b₂≧2. If b₂=G₁, the tierimplements full bundling, in which all local routers have links to allremote groups, i.e., Ω(l)=h₂ for all l and therefore Ω_(min)=h₂, suchthat the link load reverts to that of Type 2 links as each path containsonly one local hop. For the intermediate range 1<G₁/b₂<2 it can be shownthat

$\Omega_{\min} = {{{2h_{2}} - \frac{G_{1}h_{2}}{b_{2}}} = {h_{2}\left( {2 - \frac{G_{1}}{b_{2}}} \right)}}$and, taking into account the other cases above, for G₁/b₂≧1

$\Omega_{\min} = {{\max\left( {0,{h_{2}\left( {2 - \frac{G_{1}}{b_{2}}} \right)}} \right)}.}$This expression is required to compute the correct topology values for abalanced XGDF.

From the foregoing discussions on network bisection and shortest pathrouting, it follows that the conditions for a balanced multi-tierdragonfly network set forth above are sufficient for a balanced XGDFnetwork. However, higher values for b_(i) also increase the number of“degenerated” minimal paths, thus reducing average minimal path length.Therefore, the usage ratios of lower tier to higher-tier links alsodecrease. Table III depicted in FIG. 8 analyzes this effect for two-tierXGDF networks. The correction factor x is given by the relation, supra.

For an XGDF network of two tiers, the corresponding balance ratio equals

$\frac{H_{1}}{H_{2}} = \frac{{2x^{2}\frac{h_{2}}{b_{2}}} + {xh}_{2} + {\frac{h_{1}}{b_{1}}h_{2}} + \frac{h_{1}}{b_{1}}}{{x^{2}\frac{h_{2}}{b_{2}}} + {xh}_{2} + {\frac{h_{1}}{b_{1}}h_{2}} + h_{2}}$

The balance ratio quickly decreases as a function of b₂. The extremecase of b₂=2h₂+1 (x=0) corresponds to a HyperX topology;correspondingly, the balance ratio equals (h₁h2+h₁)/(h₁h₂+h₂) at thatpoint, which is close to 1 for large h₁, h₂. To find a balanced ratiobetween h₁ and h₂, a solution for h₁=h₂H₁/H₂ is computed. This equationhas an intricate closed-form solution, for which a good approximation isgiven by h₁≈2h₂−b₂, so the tier-1 group size can be decreased relativeto h₂ as b₂ increases.

Based on the foregoing discussion regarding link loading, the conditionfor p can be derived as follows. The utilization of the higher-utilized“Type 1” tier-1 links is given by the equation set forth above. Toprevent these links from becoming a bottleneck, the following relationmust hold:

${p \leq \frac{1}{\frac{1}{G_{1}G_{2}} + \frac{h_{2}}{b_{2}G_{2}} + {\frac{h_{2}}{b_{2}G_{2}}\left( {1 - \frac{\Omega_{\min}}{h_{2}}} \right)}}},$which yields

$p \leq {\frac{1}{1 + {\frac{G_{1}h_{2}}{b_{2}}\left( {2 - \frac{\Omega_{\min}}{h_{2}}} \right)}}.}$This can be approximated by

$p \leq \left\{ \begin{matrix}{h_{2} - \frac{b_{2}}{2}} & {{{for}\mspace{14mu} b_{2}} \leq \frac{2h_{2}}{3}} \\b_{2} & {{{{for}\mspace{14mu} b_{2}} > \frac{2h_{2}}{3}},}\end{matrix} \right.$which implies that in the extreme cases of the fully-scaled dragonfly(b₂=1) and the Hamming graph (b₂=h₁+1) the balanced bristling factorequals h₂, but for intermediate values of b₂, it can be as low as 2h₂/3.

For XGDF topologies having three tiers, bundling is too complex to bereadily analyzed in closed-form. Therefore, an analysis was performedusing a C++ simulated implementation of a generic XGDF topology and itscorresponding shortest-path routing algorithm. By traversing all pathsfrom a selected source processing node to every other destinationprocessing node, the number of times a hop at each tier is traversed canbe determined. This yields the per-tier hop counts H_(i), from which therelative balance ratios β_(1,2) and β_(2,3) can be determined.

To develop a suitable sample set, this simulation analysis was performedfor 457 different three-tier XGDF topologies, subject to the constrainth₃=2, which is sufficient to scale to balanced networks with up to444,222 nodes, i.e., XGDF(2:8, 4, 2:1, 1, 1). Given the upper bound onthe balance conditions, every h₂ε[h₃, 2h₃] and for every value of h₂every h₁ε[h₂, 2h₂] was analyzed. For each of these triples (h₁, h₂, h₃),every value of b₂ such that G₁h₂ mod b₂=0 and b2≦G₁ was analyzed, andfor each of those values of b₂ every value of b₃ such that G₁G₂h₃ modb₃=0 and b₃≦G₁G₂. Out of the 457 topologies analyzed, 23 are nearlyoptimally balanced, in the sense that they satisfy 0.96≦β_(1,2),β_(2,3)<1.1. For these 23 topologies summarized in FIG. 9 in Table IV, adeeper analysis was also performed to determine the maximum bristlingfactor p_(max) by traversing all paths that cross a given router. Themaximum utilization across all tiers determines the maximum injectionload per router, which can be equated with the bristling factor. (Notethat strictly speaking p does not have to be an integer value;non-integer bristling factors could be achieved by running the nodelinks at a different speed than the network links.) These values arealso listed in Table IV. Note that in most cases p_(max) is close toh₃=2, but may deviate, especially for large bundling values. Using thisapproach, all (nearly) balanced networks for a given value of h₃ can beenumerated.

Using the foregoing approximation for the bristling factor, the numberof ports per processing (end) node in the balanced two-tier case can beexpressed as:

$\frac{r}{p} = {\frac{p + h_{1} + h_{2}}{p} = \left\{ \begin{matrix}{1 + \frac{{3\frac{h_{2}}{b_{2}}} - 1}{\frac{h_{2}}{b_{2}} - \frac{1}{2}}} & {{{for}\mspace{14mu} b_{2}} \leq \frac{2h_{2}}{3}} \\{3\frac{h_{2\;}}{b_{2}}} & {{{for}\mspace{14mu} b_{2}} > \frac{2h_{2}}{3}}\end{matrix} \right.}$for b₂≦h₁+1.The number of routers per end node equals 1/p; hence

$\frac{1}{p} = \left\{ \begin{matrix}\frac{1}{h_{2} - \frac{b_{2}}{2}} & {{{for}\mspace{14mu} b_{2}} \leq \frac{2h_{2}}{3}} \\\frac{1}{b_{2}} & {{{for}\mspace{14mu} b_{2}} > \frac{2h_{2}}{3}}\end{matrix} \right.$

The deadlock avoidance policy as described above for GDF is also validfor XGDF topologies. Only in the extreme case of full bundling at alltiers, minimal dimension-order routing requires just a single virtualchannel, because each dimension is visited only once.

A good system balance is key to achieving sustained exaflop performance.Providing sufficient communication bandwidth will be critical toenabling a wide range of workloads to benefit from exascale performance.With current HPC interconnect technology, the byte-to-FLOP ratio willlikely be orders of magnitude less than in current petascale systems,which would pose significant barriers to performance portability formany, particularly for communication-intensive, workloads. For reasonsof cost and density, integrated routers are preferred towards achievingacceptable byte/FLOP ratios. Based on IO pin and power arguments, theprior art two-tier dragonfly is not amenable for an integrated exascalenetwork. Although three-tier networks actually increase the overallnumber of links and their associated power, the fact that a drasticallylower router radix is sufficient to scale to million-node networks(radix 20 vs. radix 90 for two tiers) enables low-radix routers that areamenable to integration on the main compute node, because they requiremodest link IO power and pin budgets.

Referring now to FIG. 10, there is depicted a block diagram of anexemplary processing unit 1000 that, in accordance with a preferredembodiment, includes at least one integrated router. In the depictedembodiment, processing node 1000 is a single integrated circuitcomprising a semiconductor or insulator substrate on which integratedcircuitry is formed. Processing node 1000 includes two or more processorcores 1002 a, 1002 b for processing instructions and data. In someembodiments, each processor core 1002 is capable of independentlyexecuting multiple simultaneous hardware threads of execution.

The operation of each processor core 1002 is supported by a multi-levelvolatile memory hierarchy having at its lowest level a shared systemmemory 1004 accessed via an integrated memory controller 1006, and atits upper levels, one or more levels of cache memory, which in theillustrative embodiment include a store-through level one (L1) cachewithin and private to each processor core 1002, a respective store-inlevel two (L2) cache 1008 a, 1008 b for each processor core 1002 a, 1002b. Although the illustrated cache hierarchy includes only two levels ofcache, those skilled in the art will appreciate that alternativeembodiments may include additional levels (L3, etc.) of on-chip oroff-chip, private or shared, in-line or lookaside cache, which may befully inclusive, partially inclusive, or non-inclusive of the contentsthe upper levels of cache.

Processing node 1000 further includes an I/O (input/output) controller1010 supporting the attachment of one or more I/O devices (notdepicted). Processing node 1000 additionally includes a localinterconnect 1012 supporting local communication among the componentsintegrated within processing node 1000, as well as one or moreintegrated routers 1016 that support communication with other processingnodes 1000 or other external resources via an interconnection networkhaving a GDF or XGDF topology as previously described. As shown, arouter 1016 includes a plurality of ports 1018 and a controller 1020that controls transfer of packets between ports 1018 and between router1016 and local interconnect 1012 according to one or more routingpolicies, as previously described. Router 1016 may optionally furtherinclude one or more data structures referenced by controller 1020 in thecourse of making routing determinations, such as forwarding database(FDB) 1022 and routing database (RDB) 1024.

In operation, when a hardware thread under execution by a processor core1002 of a processing node 1000 includes a memory access (e.g., load orstore) instruction requesting a specified memory access operation to beperformed, processor core 1002 executes the memory access instruction todetermine the target address (e.g., an effective address) of the memoryaccess request. After translation of the target address to a realaddress, the cache hierarchy is accessed utilizing the target address.Assuming the indicated memory access cannot be satisfied by the cachememory or system memory 1004 of the processing node 1000, router 1016 ofprocessing node 1000 may transmit the memory access request to one ormore other processing nodes 1000 of a multi-node data processing system(such as those shown in FIG. 2 or FIG. 7) for servicing. The one or moreother processing nodes 1000 may respond to the memory access request bytransmitting one or more data packets to the requesting processing node1000 via the interconnection network.

With reference now to FIG. 11, there is depicted a block diagram of anexemplary design flow 1100 used for example, in semiconductor IC logicdesign, simulation, test, layout, and manufacture. Design flow 1100includes processes, machines and/or mechanisms for processing designstructures or devices to generate logically or otherwise functionallyequivalent representations of the design structures and/or devicesdescribed above and shown herein. The design structures processed and/orgenerated by design flow 1100 may be encoded on machine-readabletransmission or storage media to include data and/or instructions that,when executed or otherwise processed on a data processing system,generate a logically, structurally, mechanically, or otherwisefunctionally equivalent representation of hardware components, circuits,devices, or systems. Machines include, but are not limited to, anymachine used in an IC design process, such as designing, manufacturing,or simulating a circuit, component, device, or system. For example,machines may include: lithography machines, machines and/or equipmentfor generating masks (e.g. e-beam writers), computers or equipment forsimulating design structures, any apparatus used in the manufacturing ortest process, or any machines for programming functionally equivalentrepresentations of the design structures into any medium (e.g. a machinefor programming a programmable gate array).

Design flow 1100 may vary depending on the type of representation beingdesigned. For example, a design flow 1100 for building an applicationspecific IC (ASIC) may differ from a design flow 1100 for designing astandard component or from a design flow 1100 for instantiating thedesign into a programmable array, for example a programmable gate array(PGA) or a field programmable gate array (FPGA) offered by Altera® Inc.or Xilinx® Inc.

FIG. 11 illustrates multiple such design structures including an inputdesign structure 1020 that is preferably processed by a design process1110. Design structure 1120 may be a logical simulation design structuregenerated and processed by design process 1110 to produce a logicallyequivalent functional representation of a hardware device. Designstructure 1120 may also or alternatively comprise data and/or programinstructions that when processed by design process 1110, generate afunctional representation of the physical structure of a hardwaredevice. Whether representing functional and/or structural designfeatures, design structure 1120 may be generated using electroniccomputer-aided design (ECAD) such as implemented by a coredeveloper/designer. When encoded on a machine-readable datatransmission, gate array, or storage medium, design structure 1120 maybe accessed and processed by one or more hardware and/or softwaremodules within design process 1110 to simulate or otherwise functionallyrepresent an electronic component, circuit, electronic or logic module,apparatus, device, or system such as those shown herein. As such, designstructure 1120 may comprise files or other data structures includinghuman and/or machine-readable source code, compiled structures, andcomputer-executable code structures that when processed by a design orsimulation data processing system, functionally simulate or otherwiserepresent circuits or other levels of hardware logic design. Such datastructures may include hardware-description language (HDL) designentities or other data structures conforming to and/or compatible withlower-level HDL design languages such as Verilog and VHDL, and/or higherlevel design languages such as C or C++.

Design process 1110 preferably employs and incorporates hardware and/orsoftware modules for synthesizing, translating, or otherwise processinga design/simulation functional equivalent of the components, circuits,devices, or logic structures shown herein to generate a netlist 1180which may contain design structures such as design structure 1120.Netlist 1180 may comprise, for example, compiled or otherwise processeddata structures representing a list of wires, discrete components, logicgates, control circuits, I/O devices, models, etc. that describes theconnections to other elements and circuits in an integrated circuitdesign. Netlist 1180 may be synthesized using an iterative process inwhich netlist 1180 is resynthesized one or more times depending ondesign specifications and parameters for the device. As with otherdesign structure types described herein, netlist 1180 may be recorded ona machine-readable storage medium or programmed into a programmable gatearray. The medium may be a non-volatile storage medium such as amagnetic or optical disk drive, a programmable gate array, a compactflash, or other flash memory. Additionally, or in the alternative, themedium may be a system or cache memory, or buffer space.

Design process 1110 may include hardware and software modules forprocessing a variety of input data structure types including netlist1180. Such data structure types may reside, for example, within libraryelements 1130 and include a set of commonly used elements, circuits, anddevices, including models, layouts, and symbolic representations, for agiven manufacturing technology (e.g., different technology nodes, 32 nm,45 nm, 90 nm, etc.). The data structure types may further include designspecifications 1140, characterization data 1150, verification data 1160,design rules 1170, and test data files 1185 which may include input testpatterns, output test results, and other testing information. Designprocess 1110 may further include, for example, standard mechanicaldesign processes such as stress analysis, thermal analysis, mechanicalevent simulation, process simulation for operations such as casting,molding, and die press forming, etc. One of ordinary skill in the art ofmechanical design can appreciate the extent of possible mechanicaldesign tools and applications used in design process 1110 withoutdeviating from the scope and spirit of the invention. Design process1110 may also include modules for performing standard circuit designprocesses such as timing analysis, verification, design rule checking,place and route operations, etc.

Design process 1110 employs and incorporates logic and physical designtools such as HDL compilers and simulation model build tools to processdesign structure 1120 together with some or all of the depictedsupporting data structures along with any additional mechanical designor data (if applicable), to generate a second design structure 1190.Design structure 1190 resides on a storage medium or programmable gatearray in a data format used for the exchange of data of mechanicaldevices and structures (e.g., information stored in a IGES, DXF,Parasolid XT, JT, DRG, or any other suitable format for storing orrendering such mechanical design structures). Similar to designstructure 1120, design structure 1190 preferably comprises one or morefiles, data structures, or other computer-encoded data or instructionsthat reside on transmission or data storage media and that whenprocessed by an ECAD system generate a logically or otherwisefunctionally equivalent form of one or more of the embodiments of theinvention shown herein. In one embodiment, design structure 1190 maycomprise a compiled, executable HDL simulation model that functionallysimulates the devices shown herein.

Design structure 1190 may also employ a data format used for theexchange of layout data of integrated circuits and/or symbolic dataformat (e.g., information stored in a GDSII (GDS2), GL1, OASIS, mapfiles, or any other suitable format for storing such design datastructures). Design structure 1190 may comprise information such as, forexample, symbolic data, map files, test data files, design contentfiles, manufacturing data, layout parameters, wires, levels of metal,vias, shapes, data for routing through the manufacturing line, and anyother data required by a manufacturer or other designer/developer toproduce a device or structure as described above and shown herein.Design structure 1190 may then proceed to a stage 1195 where, forexample, design structure 1190: proceeds to tape-out, is released tomanufacturing, is released to a mask house, is sent to another designhouse, is sent back to the customer, etc.

As has been described, the present disclosure formally generalizes thedragonfly topology to an arbitrary number of tiers, yielding thegeneralized dragonfly (GDF) topology. These networks are characterizedherein in terms of their connectivity pattern, scaling properties,diameter, average distance, cost per node, direct and indirect routingpolicies, number of virtual channels (VCs) required for deadlockfreedom, and a corresponding VC assignment policy. Moreover, specificconfigurations are disclosed that reflect balance, i.e., configurationsthat theoretically provide 100% throughput under uniform traffic.Closed-form expressions for the topological parameters of balanced two-and three-tier GDF networks are provided. Moreover, the presentdisclosure introduces the notion of a balanced router radix and derivesclosed-form expressions as a function of network size for two- andthree-tier networks. The present disclosure also demonstrates that theparameters that provide a balanced network also happen to lead to amaximum-sized network, given a fixed router radix.

In a second generalization, the present disclosure extends the GDFframework to encompass networks with more than one link in betweengroups, referred to herein as the extended generalized dragonfly (XGDF).This extension appreciates that most practical installations of a givendragonfly network are much smaller than the theoretical maximum size,which could leave many links unused. Rather than leaving such linksunused, the XGDF topology employs these links to provide additionalbandwidth between groups. To this end, bundling factors that specify thenumber of links in between groups at each tier are introduced. XGDFs areanalyzed in terms of the same criteria as GDFs, again paying specialattention to the notion of balance, and quantifying the effect ofbundling on network balance. In particular, it has been found that thebalance between first and second tier depends linearly on the bundlingfactor at the second tier, and that the bristling factor exhibits anon-monotonous behavior as a function of the bundling factor.

In at least one embodiment, a multiprocessor computer system includes aplurality of processor nodes and at least a three-tier hierarchicalnetwork interconnecting the processor nodes. The hierarchical networkincludes a plurality of routers interconnected such that each router isconnected to a subset of the plurality of processor nodes; the pluralityof routers are arranged in a hierarchy of n≧3 tiers (T₁ . . . , T_(n));the plurality of routers are partitioned into disjoint groups at thefirst tier T₁, the groups at tier T_(i) being partitioned into disjointgroups (of complete T_(i) groups) at the next tier T_(i+1) and a toptier T_(n) including a single group containing all of the plurality ofrouters; and for all tiers 1≦i≦n, each tier-T_(i−1) subgroup within atier T_(i) group is connected by at least one link to all othertier-T_(i−1) subgroups within the same tier T_(i) group.

While various embodiments have been particularly shown and described, itwill be understood by those skilled in the art that various changes inform and detail may be made therein without departing from the spiritand scope of the appended claims and these alternate implementations allfall within the scope of the appended claims.

What is claimed is:
 1. A multiprocessor computer system comprising: aplurality of processor nodes; and a multi-tier hierarchical networkinterconnecting the processor nodes, wherein the multi-tier hierarchicalnetwork includes a plurality of routers, wherein: each router isconnected to a subset of the plurality of processor nodes; the pluralityof routers are arranged in a hierarchy of n tiers (T₁, . . . , T_(n))where n is at least three; the plurality of routers are partitioned intodisjoint groups at a first tier T₁, groups of routers at eachintermediate tier T_(i) are partitioned into disjoint groups at a nexthigher tier T_(i+1), and a top tier T_(n) includes a single groupcontaining all of the plurality of routers; for all tiers 1≦i≦n, eachtier-T_(i−1) subgroup within a tier T_(i) group is connected by at leastone link to all other tier-T_(i−1) subgroups within a same tier T_(i)group.
 2. The multiprocessor computer system of claim 1, wherein eachgroup of at least one specific tier T_(i) is connected to each othergroup within a same tier T_(i+1) group by a plurality of links, suchthat multiple but less than all T_(i−1) routers from one T group areconnected to different T_(i−1) routers in its peer T_(i) group.
 3. Themultiprocessor computer system of claim 2, wherein a number of linksconnecting each pair of T_(i) subgroups is an integer divisor of thenumber of routers times the number of tier-i links per router in eachT_(i) subgroup.
 4. The multiprocessor computer system of claim 3,wherein: bundling factors at tiers (T₁, . . . , T_(n)) equal (b₁, . . ., b_(n)); a number of subgroups that comprise a tier T_(i) group equals${G_{i}^{\prime} = {\frac{\left( {\prod\limits_{j = 1}^{n - 1}\; G_{j}^{\prime}} \right)h_{i}}{b_{i}} + 1}},$where h_(i) is a number of peer ports per router for tier T_(i); a totalnumber of routers S_(i)′ that comprise a tier T_(i) group equals${S_{i}^{\prime} = {\prod\limits_{j = 1}^{i}\; G_{j}^{\prime}}};$ andfor all i, bundling factor b_(i) is an integer divisor of S_(i)′·h_(i).5. The multiprocessor computer system of claim 1, wherein a number oflinks provided by each router to connect to other groups at respectivetiers (T₁, T₂, . . . , T_(n)) equals (h₁, . . . , h_(n)), such that thenumber G_(i) of subgroups that comprise a tier T_(i) group equals$G_{n} = {{\left( {\prod\limits_{j = 1}^{n - 1}\; G_{j}} \right) \cdot h_{n}} + 1}$and a total number S_(i) of routers that comprise a tier T_(i) groupequals $S_{i} = {\left( {\prod\limits_{j = 1}^{i}\; G_{j}} \right).}$ 6.The multiprocessor computer system of claim 1, wherein a ratio between anumber of links per router used to connect to groups at respective tiers(T₁, T₂, . . . , T_(n)) equals (2^(n−1), 2^(n−2), . . . , 1).
 7. Themultiprocessor computer system of claim 1, wherein each router provides,for each link corresponding to a connection between subgroups at tierT_(i), at least 2^(n−i) distinct virtual channels for deadlock-freeshortest-path routing, for 1≦i≦n.
 8. The multiprocessor computer systemof claim 7, wherein routers perform a virtual channel mapping of trafficarriving on an incoming virtual channel number vc_(x) of a linkcorresponding to tier T_(x) to a link corresponding to outgoing tierT_(y) to an outgoing virtual channel number vc_(y) depending on index x,index y, and the incoming virtual channel number vc_(x) according to:$\begin{matrix}{{{vc}_{y} = \left\lfloor {{vc}_{x}/2^{\lambda}} \right\rfloor},{y > x}} \\{{= {{{vc}_{x} \cdot 2^{\lambda}} + 2^{\lambda + 1}}},{y < {x.}}}\end{matrix}\quad$ where λ=|x−y| equals an absolute difference betweenincoming and outgoing tier indices.
 9. The multiprocessor computersystem of claim 1, wherein each router provides, for each linkcorresponding to a connection between subgroups at tier T_(i), at least2^(n−i)+2^(n−i−1) distinct virtual channels for deadlock-free indirectrouting, for 1≦i<n and at least two distinct virtual channels for tierT_(n).
 10. The multiprocessor computer system of claim 1, wherein atleast one of the plurality of routers is integrated with one of theplurality of processing nodes.